Basketball Analytics with R

Presenter: Addison McGhee
Created By: Mathew Chandy

Outline

  • What is Basketball Analytics?
  • A Brief History of NBA Data Collection and Statistics
  • Aquiring Basketball Data Using R
  • Creating Visualizations (Shot Charts, Assist Networks)
  • Simulating March Madness Brackets and Improving Predictions
  • Conclusion and Further Resources

What is Basketball Analytics?

  • Basketball Analytics has been heavily influenced by “Sabermetrics”, or the use of statistical modeling in baseball
  • In both sports, the goal is to use data to improve player evaluation, lineup optimization, and matchup analysis, among other things
  • Fortunately, data collection in basketball has streamlined over the years, making many analytical efforts possible

Access to Data is the Key!

 

“Every revolution in science has been driven by one and only one thing: access to data.”

  • John Quackenbush, Renowned Scientist

A Brief History of NBA Data Collection

  • 1946-1947: Basic Offensive Scoring Tracked (“Box Score”)
  • 1950-1951: Shot Charts Created by Hand
  • 1973-1974: Rebounds, Steals, and Blocks Tracked
  • 1979-1980: 3pt Shot Introduced; Film Used in Practices
  • 2000: Shot Charts Generated with Excel
  • 2004: Shot Distance Tracking Introduced (Synergy Sports)
  • 2013: Optical Player Tracking via SportsVU Cameras
  • Present: Advanced Tracking (Second Spectrum, Hawk-Eye)

Thinking Outside the Box-Score: The Four Factors

 

Effective Field Goals, Turnovers, Rebound %, Free Throws

T = Team, O = Opponent

\(eFG\% = \frac{ (2PM)_T + 1.5 \times (3PM)_T }{ (2PA)_T + (3PA)_T}\)

\(TO = \frac{TOV_T}{POSS_T}\)

\(REB\% = \frac{OREB_T}{OREB_T + DREB_O}\)

\(FT\) Rate \(= \frac{FTM_T}{(2PA)_T + (3PA)_T}\)

The Four Factors by Kubatko, J., Oliver, D., Pelton, K., and Rosenbaum, D. T. (2007).

Acquiring Data

Basketball Reference

Loading Data from Basketball Reference

R Package for Manipulating Data: The Tidyverse

install.packages("tidyverse", repos = "http://cran.us.r-project.org")

R Package for Acquiring Data

if (!requireNamespace('devtools', quietly = TRUE)){
  install.packages('devtools')
}
devtools::install_github("sportsdataverse/sportsdataverse-R")

R Package for Visualizing + Analyzing Data

devtools::install_github("sndmrc/BasketballAnalyzeR")

Coding Tutorial

Graphing the Four Factors: First Attempt

Visualizing the Four Factors: Going Straight to the Source Code

Creating Shot Charts

Creating an Assist Network

Introduction to Expected Points: Tatum’s Deuce

Jayson Tatum’s 2PFG% this season is 54%. What is his expected points on a 2-point field goal attempt? 1.08

Question: How Do Expected Points Change Based on Distance from the Basket?

(Conditional) Expected Points for Starters

Cluster Analysis

March Madness

Question: How Many Possible March Madness Brackets are There?

Hint: how many games are there in the tournament not including the First Four, and how many possible outcomes are there for each game? Answer: \(2^{63}\) or \(9,223,372,036,854,775,808\)

How Brackets are Typically Scored

Round One: 1

Round Two: 2

Sweet Sixteen: 4

Elite Eight: 8

Final Four: 16

Championship: 32

A perfect bracket gets a score of 192.

Ranking Teams Using Metrics

This could be as simple as using AP rankings, or you could develop your own metric. You can evaluate your metric based on how it performs on past tournaments.

Example Metrics

Ranking Teams Using Metrics

  • We can pick the best team at each stage
  • Strict hierarchy is unrealistic and one prediction for a tournament has a lot of uncertainty

Predicting Individual Games

We can predict the probability of a team winning a certain March Madness game.

What are some models that can be used for binary classification? Logistic Regression, Decision Trees, Random Forest, SVM, Neural Network, etc.

Question: What are Potential Variables for Our Model?

Strength of Schedule, Performance in Recent Games, Performance in Recent Seasons, Injuries/Suspensions, Location, Player Matchups, Offensive/Defensive Tendencies

A “Probabilistic” Approach

Depending on choice of model, it may be possible that the team most likely to advance at one stage may be less likely to advance at a future stage.

A “Probabilistic” Approach

  • Can account for differences in playstyles
  • Not practical to compute (9,223,372,036,854,775,808 different possibilities to consider)

How Can We Model How “Good” a Team is?

We can predict how many points a team will score in a March Madness game.

What kind of variable can we use to model a count variable? Poisson or Binomial, Poisson is easier because there is only one parameter.

Regression works for continuous variables that have a support of \((-\infty, \infty)\), so we must use a link function to map the counts to real variables.

Poisson Regression: Predicting Points Scored

For regression, let \(Y\) be the number of points scored by the team of interest, and let \(x_j\) be the \(j\)th predictor out of \(n\).

Then \(\log(E(Y | x)) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n\)

A Simulation Approach

For each team in a game, we can draw from Pois\((\lambda)\), where \(\lambda\) is the predicted response from our regression for that team’s points. The team that scores more points advances. We can simulate a tournament as many times as we want. Then we can get an idea of how likely a team is to make it to a certain round. Note that the most likely bracket may not coincide with the most likely winner.

A Simulation Approach

Pros/Cons of Simulation

  • Doesn’t require as much processing
  • Hard to find optimal bracket

Vibes Bracket

Pick the teams you think will win, or the teams you personally want to win. It worked great for Florida students! Just make sure you don’t pick too few or too many upsets.

Picking Upsets

NCAA Average Number of Upsets:

  • Total Upsets: 8.5
  • First Round: 4.65
  • Second Round: 3.13
  • Elite Eight: 0.31
  • Final Four: 0.10

Conclusion and Further Resources